#install.packages("ape")
#install.packages("phytools")
#install.packages("geiger")
#install.packages("corHMM")
library(ape)
library(phytools)
library(geiger)
library(corHMM)
Species covary because they share (some) evolutionary history: we expect closely related species to be more similar to each other than more distantly related species. Treating species as independent in statistical analyses can be misleading! Let see a practical example:
set.seed(42) ## just choose an integer
tree<-NULL
while(is.null(tree))
tree<-pbtree(n=100,b=1,d=0.6,extant.only=T)
## Warning:
## no extant tips, tree returned as NULL
## Warning:
## no extant tips, tree returned as NULL
plotTree(tree,ftype="off")
fit.olm <- lm(y~x)
plot(x,y,cex=1.5,pch=21,bg="grey")
#abline(fit.olm,lwd=2,lty="dashed",col="red")
summary(fit.olm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1847 -0.5539 0.0576 0.7228 3.2083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73972 0.12231 -6.048 2.68e-08 ***
## x -0.18234 0.07078 -2.576 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.213 on 98 degrees of freedom
## Multiple R-squared: 0.06343, Adjusted R-squared: 0.05387
## F-statistic: 6.637 on 1 and 98 DF, p-value: 0.01148
These data were simulated in the absence of any evolutionary correlation between x & y! It’s easy for phylogeny to induce a type I error, since closely related species have similar phenotypes by descent, and can not be considered independent data points about the evolutionary process.
phylomorphospace(tree,cbind(x,y),label="off",node.size=c(0,0))
points(x,y,pch=21,bg="grey",cex=1.5)
#abline(fit.olm,lwd=2,lty="dashed",col="red")
Phylogenetic Independent Contrasts (PIC) is a method for accounting for phylogenetic relatedness when making interspecific trait comparisons. Now let’s see if by using Felsenstein’s (1985!) algorithm we can test this type I error.
x_PIC<-pic(x,tree)
y_PIC<-pic(y,tree)
fit.pic<-lm(y_PIC~x_PIC+0)
fit.pic
##
## Call:
## lm(formula = y_PIC ~ x_PIC + 0)
##
## Coefficients:
## x_PIC
## 0.1378
plot(x_PIC,y_PIC,cex=1.5,pch=21,bg="grey",
xlab="phylogenetically independent contrasts for x",
ylab="phylogenetically independent contrasts for y")
#abline(fit.pic,lwd=2,lty="dashed",col="red")
## add axes
abline(h=0,lty="dotted",col="grey")
abline(v=0,lty="dotted",col="grey")
summary(fit.pic)
##
## Call:
## lm(formula = y_PIC ~ x_PIC + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.09669 -0.57829 -0.06354 0.59042 2.18316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x_PIC 0.13779 0.09176 1.502 0.136
##
## Residual standard error: 0.8765 on 98 degrees of freedom
## Multiple R-squared: 0.02249, Adjusted R-squared: 0.01251
## F-statistic: 2.255 on 1 and 98 DF, p-value: 0.1364
We need to take the phylogeny into account … our trait data were simulated without an evolutionary relationship between x & y. The relationship that we retrieved iwith OLS is a type I error driven by the phylogenetic structure of our data. Type II errors are possible as well!
The Enterobacterales order of Gammaproteobacteria shifts from free-living to endosymbiont are considered to have happened independently numerous times. The phylogenetic placement of endosymbiotic lineages is highly uncertain: this is due to their peculiar sequence evolution, consisting of a non-homogeneous (amino acids frequencies do not remain constant, i.e. strong bias towards AT-rich base composition in symbionts) and non-stationary (evolutionary rates do not remain constant, i.e. accelerated evolutionary rates in symbionts) process. Such idiosyncrasies violate the assumption of traditional phylogenetic models leading to systematic error and artifactual clustering of species.
Let’s see how genes AT content maps across two phylogenies:
Here is a “simple one” (it appears to artificially cluster species based on their AT content):
tree_simp<-read.tree("tree_enterobacterales_simp.nwk")
tree_simp <- midpoint.root(tree_simp)
tree_simp <- chronos(tree_simp, model = "clock", lambda = 7)
##
## Setting initial dates...
## Fitting in progress... get a first set of estimates
## (Penalised) log-lik = -66.15454
## Optimising rates... dates... -66.15454
##
## log-Lik = -66.15454
## PHIIC = 550.31
data<-read.csv("traits_enterobacterales.csv",row.names=1)
AT_content<-setNames(as.numeric(data[,"mean_CDS_at"]),rownames(data))
fit<-fastAnc(tree_simp, AT_content,vars=TRUE,CI=TRUE)
simp_ASR<-contMap(tree_simp,AT_content,plot=FALSE)
plot(simp_ASR, ftype="off", type="fan")
Here is a “complex one” (looks much better in splitting that AT rich clade from the previous tree):
tree_comp<-read.tree("tree_enterobacterales_comp.nwk")
tree_comp <- midpoint.root(tree_comp)
tree_comp <- chronos(tree_comp, model = "clock", lambda = 7)
##
## Setting initial dates...
## Fitting in progress... get a first set of estimates
## Warning in nlminb(start.para, f, g, control = opt.ctrl, lower = LOW, upper =
## UP): NA/NaN function evaluation
## (Penalised) log-lik = -3541.173
## Optimising rates... dates... -3541.173
## Optimising rates... dates... -77.48034
## Optimising rates... dates... -77.26213
## Optimising rates... dates... -77.17727
## Optimising rates... dates... -77.10403
## Optimising rates... dates... -77.04355
## Optimising rates... dates... -76.99979
## Optimising rates... dates... -76.97363
## Optimising rates... dates... -76.95967
## Optimising rates... dates... -76.95254
## Optimising rates... dates... -76.94897
## Optimising rates... dates... -76.9472
## Optimising rates... dates... -76.94633
## Optimising rates... dates... -76.9459
## Optimising rates... dates... -76.94569
## Optimising rates... dates... -76.94559
## Optimising rates... dates... -76.94554
## Optimising rates... dates... -76.94552
## Optimising rates... dates... -76.94551
## Optimising rates... dates... -76.9455
## Warning: Maximum number of dual iterations reached.
##
## log-Lik = -76.9455
## PHIIC = 571.89
data<-read.csv("traits_enterobacterales.csv",row.names=1)
AT_content<-setNames(as.numeric(data[,"mean_CDS_at"]),rownames(data))
fit<-fastAnc(tree_comp, AT_content,vars=TRUE,CI=TRUE)
comp_ASR<-contMap(tree_comp,AT_content,plot=FALSE)
plot(comp_ASR, ftype="off", type="fan")
Let’s just act as we could forget what we learned earlier and consider all datapoints (species) unrelated:
fit<-lm(data$mean_CDS_at~data$mean_CDS_ln)
summary(fit)
##
## Call:
## lm(formula = data$mean_CDS_at ~ data$mean_CDS_ln)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.947 -9.009 -4.443 7.564 23.845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.079510 7.370279 4.624 6.61e-06 ***
## data$mean_CDS_ln 0.021855 0.007926 2.757 0.00635 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.12 on 208 degrees of freedom
## Multiple R-squared: 0.03526, Adjusted R-squared: 0.03062
## F-statistic: 7.603 on 1 and 208 DF, p-value: 0.006347
Let’s use the “simple” tree to correct with Feselstain’s PIC:
mean_CDS_at_PIC<-pic(data$mean_CDS_at,tree_simp)
mean_CDS_ln_PIC<-pic(data$mean_CDS_ln,tree_simp)
fit.pic<-lm(mean_CDS_at_PIC~mean_CDS_ln_PIC)
summary(fit.pic)
##
## Call:
## lm(formula = mean_CDS_at_PIC ~ mean_CDS_ln_PIC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -627.02 -13.63 5.35 22.98 208.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.828712 4.704883 -1.451 0.148
## mean_CDS_ln_PIC -0.015135 0.009861 -1.535 0.126
##
## Residual standard error: 67.65 on 207 degrees of freedom
## Multiple R-squared: 0.01125, Adjusted R-squared: 0.006476
## F-statistic: 2.356 on 1 and 207 DF, p-value: 0.1264
Let’s use the “complex” tree to correct with Feselstain’s PIC:
mean_CDS_at_PIC<-pic(data$mean_CDS_at,tree_comp)
mean_CDS_ln_PIC<-pic(data$mean_CDS_ln,tree_comp)
fit.pic<-lm(mean_CDS_at_PIC~mean_CDS_ln_PIC)
summary(fit.pic)
##
## Call:
## lm(formula = mean_CDS_at_PIC ~ mean_CDS_ln_PIC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.87 -14.30 4.95 17.89 319.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.785420 3.987402 -1.702 0.0903 .
## mean_CDS_ln_PIC 0.008694 0.009319 0.933 0.3519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.32 on 207 degrees of freedom
## Multiple R-squared: 0.004187, Adjusted R-squared: -0.0006236
## F-statistic: 0.8704 on 1 and 207 DF, p-value: 0.3519
Take-home message: some phylogeny is better than no phylogeny :-)
Dollo’s law of irreversibility (also known as Dollo’s law and Dollo’s principle), proposed in 1893 by Belgian paleontologist Louis Dollo states that, “an organism never returns exactly to a former state, even if it finds itself placed in conditions of existence identical to those in which it has previously lived … it always keeps some trace of the intermediate stages through which it has passed.” The statement is often misinterpreted as claiming that evolution is not reversible, or that lost structures and organs cannot reappear in the same form.
In 2003 a paper published in Nature by Whiting et al. suggested that wings may have reappeared several times within the ancestrally wingless phasmid, possibly on many occasions. We have inferred a comprehensive timetree of Phasmatodea and coded presence and absence of wings (as binary states: 2=absence and 1=presence). Missing data are coded as NA!
tree<-read.tree("tree_phasmids.nwk")
data<-read.csv("traits_phasmids.csv",row.names=1)
chk<-name.check(tree,data)
tree.pruned<-drop.tip(tree,chk$tree_not_data)
data.pruned<-data[!(rownames(data)%in%chk$data_not_tree),,drop=FALSE]
wings<-setNames(as.factor(data.pruned[,"wings"]),rownames(data.pruned))
With the idea that “all model are wrong, but some are usefull” we have tested some of them to see which one fits best to phasmid phylogeny.
ER<-fitDiscrete(tree.pruned,data.pruned,model="ER")
plot(ER)
ARD<-fitDiscrete(tree.pruned,data.pruned,model="ARD")
plot(ARD)
This is a loss-only model, with a custom matrix (it doesn’t make a lot of biological sense …):
model.loss<-matrix(c(0,1,0,0),2,2)
rownames(model.loss)<-colnames(model.loss)<-levels(wings)
loss_only<-fitDiscrete(tree.pruned,data.pruned,model=model.loss)
plot(loss_only)
This is a gain-only model, with a custom matrix (this is the biological hypothesis taht we are testing against …):
model.gain<-matrix(c(0,0,1,0),2,2)
rownames(model.gain)<-colnames(model.gain)<-levels(wings)
gain_only<-fitDiscrete(tree.pruned,data.pruned,model=model.gain)
plot(gain_only)
Now we can compare the four different models using both lnL and Akaike weights:
lnL<-setNames(
c(gain_only$opt$lnL,loss_only$opt$lnL,
ER$opt$lnL,ARD$opt$lnL),
c("Loss-only","Gain-only","Equal Rates","All Rates Different"))
lnL
## Loss-only Gain-only Equal Rates All Rates Different
## -147.6793 -147.4727 -135.6834 -134.2684
aicc<-setNames(
c(gain_only$opt$aicc,loss_only$opt$aicc,
ER$opt$aicc,ARD$opt$aicc),
c("Loss-only","Gain-only","Equal Rates","All Rates Different"))
aicw(aicc)
## fit delta w
## Loss-only 297.3734 24.7918213 2.475827e-06
## Gain-only 296.9603 24.3787354 3.043833e-06
## Equal Rates 273.3816 0.7999773 4.013129e-01
## All Rates Different 272.5816 0.0000000 5.986816e-01
The best model is All Rates Different! The model which better describes our data (trait of extant species and phylogeny) allows for reversions!
We then carried out an ancestral state reconstruction throughout the phylogeny, using the best-fit model (ARD):
fitARD<-ace(wings,tree.pruned,model="ARD",type="discrete")
no_br_tree <- compute.brlen(tree.pruned, power=0.4)
plot.phylo(no_br_tree,type="fan",lwd=1, edge.width = 1, label.offset = 2,show.tip.label = FALSE,no.margin = TRUE)
nodelabels(node=1:tree.pruned$Nnode+Ntip(tree.pruned),pie=fitARD$lik.anc,cex=0.2)
tiplabels(pie=to.matrix(wings[tree.pruned$tip.label],levels(wings)),cex=0.2)
fitARD
##
## Ancestral Character Estimation
##
## Call: ace(x = wings, phy = tree.pruned, type = "discrete", model = "ARD")
##
## Log-likelihood: -133.9502
##
## Rate index matrix:
## 1 2
## 1 . 2
## 2 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 1.7624 0.3677
## 2 3.3200 0.5502
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## 1 2
## 0.1650584 0.8349416
Plenty of reversions are observed and the probability of the Most Recent Common Ancestor (MRCA) of all phasmids beeing wingless is around 83% :-O.